Thesis
Datasets
UCI and Kaggle datasets
- Numerical
- Binary
- Less than 10 attributes
- 10 or more attributes
- Multiclass
- Less than 10 attributes
- Binary
- Mixed
- Binary
Keel datasets
- Imbalanced
- Binary
- Imbalance ratio between 1.5 and 9
- Imbalance ratio higher than 9
- Multiclass
- Binary
- Noisy
- [cn] Class noise
- [an] Attribute noise
- [an_nn] noisy train, noisy test
- 5% noise
- 20% noise
- [an_nc] noisy train, clean test
- 5% noise
- 20% noise
- [an_cn] clean train, noisy test
- 5% noise
- 20% noise
- [an_nn] noisy train, noisy test
- [cn] Class noise
Attribute noise
Numerical
Binary
< 10
Banknote authentication
Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.
- Source: UCI Machile Learning Repository
- Classification: binary
- Input features: numerical
- Number of rows: 1372
- Number of attributes: 4
Description of the attributes:
variance: variance of wavelet transformed image numericalskewness: skewness of wavelet transformed image numericalcurtosis: curtosis of wavelet transformed image numericalentropy: entropy of the image numericalclass:
Data
# A tibble: 1,372 x 5
class variance skewness curtosis entropy
<fct> <dbl> <dbl> <dbl> <dbl>
1 0 3.62 8.67 -2.81 -0.447
2 0 4.55 8.17 -2.46 -1.46
3 0 3.87 -2.64 1.92 0.106
4 0 3.46 9.52 -4.01 -3.59
5 0 0.329 -4.46 4.57 -0.989
6 0 4.37 9.67 -3.96 -3.16
7 0 3.59 3.01 0.729 0.564
8 0 2.09 -6.81 8.46 -0.602
9 0 3.20 5.76 -0.753 -0.613
10 0 1.54 9.18 -2.27 -0.735
# … with 1,362 more rows
Haberman
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
- Source: UCI Machile Learning Repository
- Number of rows: 306
- Number of attributes: 3
- Classification:* binary
- Input features:* numerical
Description of the attributes:
age: Age of patient at time of operation numericalyear: Patient’s year of operation numericalnodes: Number of positive axillary nodes detected numericalclass: Survival status (class attribute)- 1 = the patient survived 5 years or longer [positive]
- 2 = the patient died within 5 year
Skin segmentation
The skin dataset is collected by randomly sampling B,G,R values from face images of various age groups (young, middle, and old), race groups (white, black, and asian), and genders obtained from FERET database and PAL database. Total learning sample size is 245057; out of which 50859 is the skin samples and 194198 is non-skin samples. Color FERET Image Database: [Web Link], PAL Face Database from Productive Aging Laboratory, The University of Texas at Dallas: [Web Link].
- Source: UCI Machile Learning Repository
- Number of rows: 245057
- Number of attributes: 3
- Classification: binary
- Input features: numerical
Description of the attributes:
red: numericalgreen: numericalblue: numericalclass:- 1: it is a skin sample [positive]
- 2: it is not a skin sample
Vertebral column 2
Biomedical data set built by Dr. Henrique da Mota during a medical residence period in the Group of Applied Research in Orthopaedics (GARO) of the Centre Médico-Chirurgical de Réadaptation des Massues, Lyon, France. The task consists in classifying patients as belonging to one out of two categories: Normal (100 patients) or Abnormal (210 patients). We provide files also for use within the WEKA environment.
Classifying patients as belonging to one out of three categories: Normal (100 patients), Disk Hernia (60 patients) or Spondylolisthesis (150 patients).
- Source: UCI Machile Learning Repository
- Classification: binary
- Input features: numerical
- Number of rows: 310
- Number of attributes: 6
Description of the attributes:
>= 10
Audit risk
Many risk factors are examined from various areas like past records of audit office, audit-paras, environmental conditions reports, firm reputation summary, on-going issues report, profit-value records, loss-value records, follow-up reports etc. After in-depth interview with the auditors, important risk factors are evaluated and their probability of existence is calculated from the present and past records.
The goal of the research is to help the auditors by building a classification model that can predict the fraudulent firm on the basis the present and historical risk factors. The information about the sectors and the counts of firms are listed respectively as Irrigation (114), Public Health (77), Buildings and Roads (82), Forest (70), Corporate (47), Animal Husbandry (95), Communication (1), Electrical (4), Land (5), Science and Technology (3), Tourism (1), Fisheries (41), Industries (37), Agriculture (200).
- Source: UCI Machile Learning Repository
- Classification: binary
- Input features: numerical
- Number of rows: 776
- Number of attributes: 24
Description of the attributes:
att1: numericalatt2: numericalatt3: numericalatt4: numericalatt5: numericalatt6: numericalatt7: categoricalclass:- Abnormal: [positive]
- Normal:
Eeg eye state
All data is from one numerical EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. ‘1’ indicates the eye-closed and ‘0’ the eye-open state. All values are in chronological order with the first measured value at the top of the data.
- Source: UCI Machile Learning Repository
- Classification: binary
- Input features: numerical
- Number of rows:
- Number of attributes:
Description of the attributes:
Ionospheren
This radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. See the paper for more details. The targets were free electrons in the ionosphere. “Good” radar returns are those showing evidence of some type of structure in the ionosphere. “Bad” returns are those that do not; their signals pass through the ionosphere.
Received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay system. Instances in this databse are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal.
- Source: UCI Machile Learning Repository
- Classification: binary
- Input features: numerical
- Number of rows: 351
- Number of attributes: 32
Description of the attributes:
X1-X34: numericalclass:- Bad: [positive]
- Good:
Sonar
contains 111 patterns obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions. The file “sonar.rocks” contains 97 patterns obtained from rocks under similar conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock.
Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the chirp.
The label associated with each record contains the letter “R” if the object is a rock and “M” if it is a mine (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.
- Source: UCI Machile Learning Repository
- Classification: binary
- Input features: numerical
- Number of rows: 208
- Number of attributes: 60
Description of the attributes:
V1-V60: numericalclass:- M: [positive]
- R:
Multiclass
< 10
Ecoli
Desription of the dadtaset
- Source: UCI Machile Learning Repository
- Classification: multiclass
- Input features: numerical
- Number of rows: 336
- Number of attributes: 7
Description of the attributes:
mcg: numericalgvh: numericallip: numericalchg: numericalaac: numericalalm1: numericalalm2: categoricalclass:- cp
- im
- imS
- imL
- imU
- om
- omL
- pp
Iris
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
Predicted attribute: class of iris plant.
This is an exceedingly simple domain.
This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick ‘@’ espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,“Iris-setosa” where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,“Iris-setosa” where the errors are in the second and third features.
- Source: UCI Machile Learning Repository
- Classification: multiclass
- Input features: numerical
- Number of rows: 150
- Number of attributes: 4
Description of the attributes:
Sepal.Length: numericalSepal.Width: numericalPetal.Length: numericalPetal.Width: numericalclass:- Iris-setosa
- Iris-versicolor
- Iris-virginica
Life expectancy
This dataset contains 6 columns and 223 Rows. Each row corresponds to a country in order of their life expectancy rank. The dataset has three numeric columns, Overall Life Expectancy, Male Life Expectancy and Female Life Expectancy. The last column is Continent, which defines which continent that country lies in. This could be very well used as a class for the data.
This data can be used for classification by various techniques like SVM(linear), KNN, C.45 etc. and other supervised and unsupervised techniques.
- Source: Kaggle
- Classification: multiclass
- Input features: numerical
- Number of rows: 223
- Number of attributes: 3
Description of the attributes:
overall: numericalmale: numericalfemale: numericalclass:- Europe
- Asia
- Oceania
- North
- America
- Africa
- South America
Seeds
The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.
The data set can be used for the tasks of classification and cluster analysis.
- Source: UCI Machile Learning Repository
- Classification: multiclass
- Input features: numerical
- Number of rows: 210
- Number of attributes: 7
Description of the attributes:
area: numericalperimeter: numericalcompactness: C = 4piA/P^2, numericallength_kernel: numericalwidth_kernel: numericalasymmetry_coefficient: numericallength_kernel_groove: numericalclass:- 1
- 2
- 3
Vertebral column 3
ref a vertebral column 2 aunque sí que escribir descripción aquí
Wifi localization
Collected to perform experimentation on how wifi signal strengths can be used to determine one of the indoor locations.
- Source: UCI Machile Learning Repository
- Classification: multiclass
- Input features: numerical
- Number of rows: 2000
- Number of attributes: 7
Description of the attributes:
att1: numericalatt2: numericalatt3: numericalatt4: numericalatt5: numericalatt6: numericalclass:- 1
- 2
- 3
- 4
Yeast
Desription of the dadtaset
- Source: UCI Machile Learning Repository
- Classification: multiclass
- Input features: numerical
- Number of rows: 1484
- Number of attributes: 8
El original tiene un atributo mas Sequence Name: Accession number for the SWISS-PROT database
Description of the attributes:
mcg: McGeoch’s method for signal sequence recognition. numericalgvh: von Heijne’s method for signal sequence recognition. numericalalm: Score of the ALOM membrane spanning region prediction program. numericalmit: Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins. numericalerl: Presence of “HDEL” substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute. numericalpox: Peroxisomal targeting signal in the C-terminus. numericalvac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins. numericalnuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins. numericalclass:- CYT
- ERL
- EXC
- ME1
- ME2
- ME3
- MIT
- NUC
- POX
- VAC
>=10
Mixed
Binary
<10
acute_inflammations1
acute_inflammations2
caesarian
mini_mammographic_masses
>= 10
statlog
Multiclass
< 10
abalone
teaching_assistant
>= 10
contraceptive
Categorical
Binary
balance_scale
breast_cancer
mini_cars
somerville
mini_tic_tac_toe
Multiclass
post_operative
mini_connect4
soybean_large
zoo
Keel
Imbalanced
Imbalanced data sets are a special case of classification problem where the class distribution is not uniform among the classes. Typically, they are composed by two classes: The majority (negative) class and the minority (positive) class.
Binary
Binary: Imbalance ratio between 1.5 and 9
# A tibble: 7 x 5
name instances features classes proportion
<chr> <int> <dbl> <int> <chr>
1 imb_ecoli_0_vs_1 220 7 2 [0.35/0.65]
2 imb_glass0 214 9 2 [0.67/0.33]
3 imb_glass1 214 9 2 [0.64/0.36]
4 imb_glass6 214 9 2 [0.86/0.14]
5 imb_haberman 306 3 2 [0.74/0.26]
6 imb_iris0 150 4 2 [0.67/0.33]
7 imb_wisconsin 683 9 2 [0.65/0.35]
class Mcg Gvh Lip Chg Aac Alm1 Alm2
1 positive 0.49 0.29 0.48 0.5 0.56 0.24 0.35
2 positive 0.07 0.40 0.48 0.5 0.54 0.35 0.44
3 positive 0.56 0.40 0.48 0.5 0.49 0.37 0.46
4 positive 0.59 0.49 0.48 0.5 0.52 0.45 0.36
5 positive 0.23 0.32 0.48 0.5 0.55 0.25 0.35
6 positive 0.67 0.39 0.48 0.5 0.36 0.38 0.46
7 positive 0.29 0.28 0.48 0.5 0.44 0.23 0.34
8 positive 0.21 0.34 0.48 0.5 0.51 0.28 0.39
9 positive 0.20 0.44 0.48 0.5 0.46 0.51 0.57
[ reached 'max' / getOption("max.print") -- omitted 211 rows ]
class RI Na Mg Al Si K Ca Ba
1 positive 1.515888 12.87795 3.43036 1.40066 73.2820 0.68931 8.04468 0
2 positive 1.517642 12.97770 3.53812 1.21127 73.0020 0.65205 8.52888 0
3 positive 1.522130 14.20795 3.82099 0.46976 71.7700 0.11178 9.57260 0
4 positive 1.522221 13.21045 3.77160 0.79076 71.9884 0.13041 10.24520 0
5 positive 1.517551 13.39000 3.65935 1.18880 72.7892 0.57132 8.27064 0
6 positive 1.520991 13.68925 3.59200 1.12139 71.9604 0.08694 9.40044 0
7 positive 1.517551 13.15060 3.60996 1.05077 73.2372 0.57132 8.23836 0
Fe
1 0.1224
2 0.0000
3 0.0000
4 0.0000
5 0.0561
6 0.0000
7 0.0000
[ reached 'max' / getOption("max.print") -- omitted 207 rows ]
class RI Na Mg Al Si K Ca Ba
1 negative 1.515888 12.87795 3.43036 1.40066 73.2820 0.68931 8.04468 0
2 negative 1.517642 12.97770 3.53812 1.21127 73.0020 0.65205 8.52888 0
3 negative 1.522130 14.20795 3.82099 0.46976 71.7700 0.11178 9.57260 0
4 negative 1.522221 13.21045 3.77160 0.79076 71.9884 0.13041 10.24520 0
5 negative 1.517551 13.39000 3.65935 1.18880 72.7892 0.57132 8.27064 0
6 negative 1.520991 13.68925 3.59200 1.12139 71.9604 0.08694 9.40044 0
7 negative 1.517551 13.15060 3.60996 1.05077 73.2372 0.57132 8.23836 0
Fe
1 0.1224
2 0.0000
3 0.0000
4 0.0000
5 0.0561
6 0.0000
7 0.0000
[ reached 'max' / getOption("max.print") -- omitted 207 rows ]
class RI Na Mg Al Si K Ca Ba
1 negative 1.515888 12.87795 3.43036 1.40066 73.2820 0.68931 8.04468 0
2 negative 1.517642 12.97770 3.53812 1.21127 73.0020 0.65205 8.52888 0
3 negative 1.522130 14.20795 3.82099 0.46976 71.7700 0.11178 9.57260 0
4 negative 1.522221 13.21045 3.77160 0.79076 71.9884 0.13041 10.24520 0
5 negative 1.517551 13.39000 3.65935 1.18880 72.7892 0.57132 8.27064 0
6 negative 1.520991 13.68925 3.59200 1.12139 71.9604 0.08694 9.40044 0
7 negative 1.517551 13.15060 3.60996 1.05077 73.2372 0.57132 8.23836 0
Fe
1 0.1224
2 0.0000
3 0.0000
4 0.0000
5 0.0561
6 0.0000
7 0.0000
[ reached 'max' / getOption("max.print") -- omitted 207 rows ]
class Age Year Positive
1 negative 8 2 13
2 negative 9 6 24
3 negative 19 5 2
4 negative 23 3 13
5 negative 17 11 24
6 negative 26 10 1
7 negative 34 1 1
8 negative 25 12 16
9 positive 15 9 1
10 negative 22 4 1
11 negative 31 8 30
12 negative 24 5 1
13 negative 14 4 1
14 negative 19 4 1
15 negative 25 9 11
16 negative 13 6 7
17 negative 7 2 28
18 negative 12 3 2
[ reached 'max' / getOption("max.print") -- omitted 288 rows ]
class SepalLength SepalWidth PetalLength PetalWidth
1 positive 5.1 3.5 1.4 0.2
2 positive 4.9 3.0 1.4 0.2
3 positive 4.6 3.1 1.5 0.2
4 positive 5.0 3.6 1.4 0.2
5 positive 5.4 3.9 1.7 0.4
6 positive 4.6 3.4 1.4 0.3
7 positive 5.0 3.4 1.5 0.2
8 positive 4.4 2.9 1.4 0.2
9 positive 5.4 3.7 1.5 0.2
10 positive 4.8 3.4 1.6 0.2
11 positive 4.8 3.0 1.4 0.1
12 positive 4.3 3.0 1.1 0.1
13 positive 5.7 4.4 1.5 0.4
14 positive 5.4 3.9 1.3 0.4
15 positive 5.1 3.5 1.4 0.3
[ reached 'max' / getOption("max.print") -- omitted 135 rows ]
class ClumpThickness CellSize CellShape MarginalAdhesion
1 negative 6 1 1 1
2 negative 6 5 5 6
3 negative 4 1 1 1
4 negative 7 9 9 1
5 negative 5 1 1 4
6 positive 9 2 2 9
7 negative 1 1 1 1
EpithelialSize BareNuclei BlandChromatin NormalNucleoli Mitoses
1 3 1 4 1 1
2 8 2 4 3 1
3 3 3 4 1 1
4 4 5 4 8 1
5 3 1 4 1 1
6 8 2 10 8 1
7 3 2 4 1 1
[ reached 'max' / getOption("max.print") -- omitted 676 rows ]
Binary: Imbalance ratio higher than 9
Multiclass
Noisy
Class noise
Attribute noise
“mini_chess”
Template::::
Desription of the dadtaset
- Source: UCI Machile Learning Repository
- Classification: multiclass
- Input features: numerical
- Number of rows:
- Number of attributes:
Description of the attributes: